Raoul Grouls
A probability distribution is:
a mathematical description
of a random phenomenon
in terms of all its possible outcomes
and their associated probabilities
In this image, what do you expect to be:
The main types of distributions are:
Every horizontal line you draw can be interpreted as a continuous distribution. Every barplot as a discrete distribution.
All the distributions we are going to discuss are variations of these two basic types!
For parametric distributions, we have a formula that describes the line / bars. You just put in the parameters, and the output is the line / bars.
A probability mass function (pmf) describes the probability distribution of discrete variables.
Consider a toin coss:
\[ f(x) = \begin{cases} 0.5 & x \text{ is head} \\ 0.5 & x \text{ is tails} \end{cases} \]
This is the pmf of the Bernoulli distribution
The probability is a function \(f\) over the sample space \(\mathscr{S}\) of a discrete random variable \(X\), which gives the probability that \(X\) is equal to a certain value. \[f(x) = P(X = x)\]
Each pmf satisfies these conditions: \[ \begin{align} f(x) \geq 0 ,\, \forall x \in X\\ \Sigma_{x \in \mathscr{S}} \, f(x) = 1 \end{align} \]
for a collection \(\mathscr{A}\) \[P(\mathscr{A} \in \mathscr{S}) =\Sigma_{t_i \in \mathscr{A}} f(t)\]
For continuous distributions, we use a probability density function (pdf).
This might look like unnecessary mathematical details. But it is actually important to understand the difference.
Example: can you answer the question “What is the probability your body temperature is 37.0 C?”
The answer might be unexpected: 0!
Let’s say your answer is 25%. But what if your temperature is 37.1? does that count? Or 37.01?
Because the distribution is continuous you can only say something about the range “What is the probability your temperature is between 36.5 and 37.2 C?”
We will look at three different discrete distributions.
The simplest case for discrete is a barplot with just two options.
We call this “Bernoulli distribution”, named after Jacob Bernoulli (1655-1705). He also discoverd the constant \(e\) and published work on infinite series.
The pmf is: \[ f(x) = \begin{cases} p & x \text{ is head} \\ 1-p & x \text{ is tails} \end{cases} \]
The pmf is: \[ f(n) = \frac{1}{n}\]
Question: think about the calls a callcenter receives. When can you use a Poisson?
You will need a constant rate! While 9h-17h is a fixed interval of time, the rate is probably not constant.
But maybe 9h-12h is! You can use multiple poissons for every timeframe!
There is one parameter \(\lambda \in (0, \inf)\)
The pmf is: \[ f(X=k) = \frac{\lambda e^{-\lambda}}{k!}\]
This is one of the distributions that is used most often.
A major reason for this is, that if you keep sampling and adding from a population you always end up with a normal distribution.
Take a persons height.
Thus, height will be normally distributed. So will the weight of wolves or the length of a penguins wing.
However, multiplying values will give you a long tail!
This is the case when variables interact in some way, and are not independent.
\[4 + 4 + 4 + 4 = 16\]
but
\[4 * 4 * 4 * 4 = 256\]
However, multiplying values will give you a long tail!
This is the case when variables interact in some way, and are not independent.
\[4 + 4 + 4 + 4 = 16\]
but
\[4 * 4 * 4 * 4 = 256\]
This is common if variables interact with each other. Examples are stock prices, failures of machines, ping times on a network, income distribution.
multiplying values will give you a fat-tail distribution! This will typically be a log-normal distribution:
if \(X\) is log-normal distibuted, then \(Y = log(X)\) will be a normal distribution.
The normal distribution has two parameters:
The most basic form of the pdf is \[f(x) =e^{-(x)^2}\]
The full pdf is:
\[f(x; \mu, \sigma) = \frac{1}{\sigma\sqrt{2 \pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]
The exponential distribution is connected to the Poisson distribution.
You can use it to model the time between independent events in a Poisson process
The assumption of a constant rate is rarely satisfied, but for our models it is often a good enough approximation.
the pdf: \[f(x, \lambda) = \lambda e^{-\lambda x}\]
The beta distribution describes the probability of probabilities
It is typically used to describe the evolution of an informed belief.
Let’s say you want to determine if a coin is fair. The probability should be somehwere beteen 0.0 and 1.0, that’s all you can say at first.
The range must always between 0.0 and 1.0!
This is represented by \(Beta(1,1)\) (a continuous uniform distribution!)
If you start counting heads and tails, your belief will change over time
The evolution could look like this:
If you keep tossing, you will get more certain.
There are two parameters, \(\alpha\) and \(\beta\). The easiest way to interpret them is as the amount of coinflips head vs tails.
The pdf is a bit complex because it involves another function \(\Gamma\) :
\[f(x; \alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}\]
People thing in terms of stories - thus the unreasonable power of the anecdote to drive decisio-making.1
But existing analytics often fails to provide this kind of story. Instead, numbers seemingly appear out of thin air.1
Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and scientific persuasion.1
It is important to discern between the two! There are limits to what you can do with your models!
A model has parameters.
Let’s say we use a simple linear model:
\[f(x)= ax + b\]
We could use data to estimate the optimal value for \(a\) and \(b\). Let’s say we find \(a=1.0\) and \(b=5.0\).
But how certain are we about our output?
Could it be \(b=5.1\)? Or even \(b=5.5\)? Or \(b=6.0\)?
We can describe our parameters as distributions.
This way we could even model events with small datasets!
E.g.
\[a \sim \mathscr{N}(\mu, \sigma)\]
Meaning: the parameter \(a\) follows a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).
We can now express our uncertainty by saying: The value of \(a\) has a mean \(\mu=1.0\) and standard deviation \(\sigma=0.2\).
The shaded area is the 99% confidence interval.
Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)
Now I tell you that the \(x\) axis is the amount of hours invested in study, and the \(y\) axis is the average grade of a student.
Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)
Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)
Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)